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Abstract 

The quality and quantity of articles in each 
Wikipedia language varies greatly. Trans¬ 
lating from another Wikipedia is a natural 
way to add more content, but the trans¬ 
lation process is not properly supported 
in the software used by Wikipedia. Past 
computer-assisted translation tools built 
for Wikipedia are not commonly used. We 
created a tool that adapts to the specific 
needs of an open community and to the 
kind of content in Wikipedia. Qualitative 
and quantitative data indicates that the new 
tool helps users translate articles easier and 
faster. 

1 Introduction 


Wikipedia is the most multilingual encyclopedic 
knowledge archive, with over 280 languages with 
varying amount of content. Knowledge available 
for a user is limited by the languages used to ac¬ 
cess it. Translation has been a common way to 
expand knowledge across languages in Wikipedia. 
The editing activity of the top 46 language editions 
of Wikipedia shows that 25% of edits by multilin¬ 
gual users are for the same article in different lan¬ 
guages (Hale, 20131. 

It is not necessary to use any tool to trans¬ 
late Wikipedia articles. However, it is a com¬ 
plicated process and mainly done by experienced 
Wikipedia editors. 

There were many attempts to build tools 
to support translation of articles. None has 
seen widespread use: in our research only few 
users reported using those tools when translating 
Wikipedia articles. 
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In this paper we present a new approach to sup¬ 
port translation which has been designed taking 
into account the unique needs of Wikipedia content 
and their community. Content Translation (CX) is 
a new tool that automates many steps of the trans¬ 
lation process and validates the approach in prac¬ 
tice. It was first enabled on 8 Wikipedias as an opt- 
in feature to create new articles in January 2015. 
Selected language pairs have machine translation 
(MT) support. 

2 Previous work 


MediaWiki, the software powering Wikipedia, is 
translated to hundreds of languages using the 
Translate extension. No such solution was avail¬ 
able for translating Wikipedia articles, leaving a 
gap in the translation support. 

There were at least ten instances of translation 
tools built for Wikipedirj^ Those tools can be di¬ 
vided into two groups based on whether the tool 
creators already possessed MT software. The first 
group is composed of companies such as Google 
and Microsoft, but also smaller companies and re¬ 
searchers. The other group of tools has been cre¬ 
ated just for Wikipedia article translation, mostly 
by volunteers. 

Among the earliest tools were GTT by Google 
and WikiBhasha by Microsoft, using their own MT 
services ([Garcta and Stevenson, 2009 [Kumaran^ 


et ah, 20111. Later, Casmacat for professionals and 


researchers ( Alabau et ah, 2013] l and CoSyne for 
multilingual MediaWikis ( Bronner et ah, 2012| ), 
unlike Wikipedias which are monolingual. 

Common to all such tools is that they are not in¬ 
tegrated into Wikipedia. To use them one needs 
to go to another website or install software. CX 


'Details collected at https : / /meta . wikimedia . org/ 
wiki/Machine_translation 


















is integrated into Wikipedia and provides a WYSI¬ 
WYG editor (what you see is what you get). 


3 Designing the translation experience 

The design of CX was aimed at improving the 
existing process users followed when translating. 
Following the principles of User-Centered De¬ 
sign (Norman and Draper, 19861, we organised pe¬ 
riodic user research sessions to (a) better under¬ 
stand the user needs during the existing translation 
process, and (b) validate new ideas on how to im¬ 
prove this process. 


users to translate the full article. As illustrated in 
Figure [T] users add content to the translation one 
paragraph at a time. When a paragraph is added, 
an initial translation based on MT is provided for 
the user to edit. MT is used if available, but the 
user can also start with the source text or an empty 
paragraph if that is preferred. 

Unlike other tools that define a strong bound¬ 
ary to translate on a per sentence basis, working 
at a paragraph level allows users to reorganise sen¬ 
tences and accommodate different editing patterns. 

Provide context information 


3.1 User research 

We recruited 106 participants using a surve)0 
From their responses we identified dictionaries 
(76% of participants used them), and Wikipedia 
(60%) as their most used tools when translating. 
MT (53%), spell checkers (48%) and glossaries 
(42%) were also common. Less than 6% of the 
participants mentioned tools specifically aimed af 
Wikipedia article translation, such as those de¬ 
scribed in Section and no tool was mentioned 
by more than one participant. 

We organised 16 research sessions. Sessions 
were organised in two parts. In the first part, con¬ 


textual inquiry techniques (Beyer and Holtzblatt, 


19981 were applied to observe user behaviour 
while translating, and identify their needs. The 


second part was a usability testing study (Nielsen, 


19941 to evaluate different design ideas in the form 


of prototypes. 


3.2 Design principles 

The research sessions were instrumental to guide 
the design of the translation experience|^ The fol¬ 
lowing design principles summarise the approach 
we followed when designing the tool. 


Freedom of translation 


There is a significant diversity in Wikipedia con¬ 
tent across languages. On average, two articles 
from different languages on the same topic have 
just 41% of common content (|Hecht and GergleT] 


20101. In contrast to other kinds of content, such 


as software user interface strings or documenta¬ 
tion, Wikipedia articles are not intended to be exact 
translations that are always kept in sync. In order 
to support that content diversity, CX does not force 


^https://goo.gl/iKQIDh 

detailed design specification is available at https : / / 
WWW .mediawiki.org/wiki/CX 


In CX the original article and the translation 
are shown side-by-side. Each paragraph is dy¬ 
namically aligned vertically with the correspond¬ 
ing translated paragraph, regardless of the differ¬ 
ence in length. This allows users to quickly have 
an overview of what has already been translated 
and what has not. 

Contextual information reduces the need for the 
user to navigate and reorient. When translating a 
sentence, the corresponding sentence in the orig¬ 
inal document is highlighted. In addition, when 
manipulating the content, options are provided an¬ 
ticipating the user’s next steps. In Figure [T] based 
on the user’s text selection, the user can explore 
the article related to the selected text (in the source 
or target languages), or turn the selected text into a 
link. Dictionary can be accessed inside the tool by 
selecting a word or by using the search box in the 
tools column. 

Focus on the translation 

We identified many sfeps in the translation pro¬ 
cess that could be automated. Users spend time 
making sure each link they translated points to the 
correct article in the target Wikipedia, and recre¬ 
ating the text formatting that was lost when using 
an external translation service. They also look for 
categories available in the target Wikipedia to clas¬ 
sify the translated article, and save constantly dur¬ 
ing the process to avoid losing their work. 

CX deals with those aspects automatically. 
When adding a paragraph, the initial translation 
preserves the text format. Modifications to the 
translated content are saved automatically. Links 
point to the right articles if they exist and ex¬ 
isting categories are added to the article thanks 
to Wikidatrj^ a structured data knowledge reposi¬ 
tory, that maps corresponding concepts across lan- 

^https://WWW.wikidata . org 














> All translations 


Saved 2 minutes ago 


Publish translatton 


Cupcake 


espanol view page 

^ 2 categories 



Cupcakes con glaseado de chocolate. 


Un cupcake —literalmente en espanol: «tarta 
en taza»—, es una pequena porcion de tarta 
para una persona. Se hornean en un molde 
igual que el de magdalenas y muffins. En el 
molde se colocan unos papeles llamados 
capsulas. 

Normalmente es confundido con los muffins y 
con las magdalenas, aunque presentan 
muchas diferenclas. 

Este postre surge en el siglo xix. Antes de que 


Cupcake 

catalA 


^ No categories 



Cupcakes amb setinat de xocolata. 


Un cupcake —llteralment en espanyol: 
opastis en tassa»—, es una petita porcid de 
pastis per a una persona. Es prepara en un 
motile Igual que el de magdalenes I muffins. 
En el motile es coMoquen uns papers 
anomenats capsules. 

+ Add translation 


pastis 

B / iE :E 

Link espafiol 

Link catal^ 

Pastis 

+ Add link 


^ Provide feedback 



Figure 1: The source and translated content side-by-side and additional tools on the right. 


guages. As those aspects are automated, users can 
focus on adapting content for the initial version of 
the article rather than on technical and formatting 
tasks. 

Quality is key 

One of the concerns raised early by the partici¬ 
pants was about MT quality. Users were concerned 
about the potential proliferation of low quality con¬ 
tent in Wikipedia articles. 

In order to respond to that concern, CX keeps 
track of the amount of text that is added from MT 
without further modification by the users. When a 
given threshold is exceeded, a warning is shown to 
users encouraging them to focus on quality more 
than quantity. 

4 WYSIWYG implementation 

MediaWiki’s wiki text is not standardised. For a 
long time, the only way to use wikitext was to 
render it to HTML with MediaWiki. Parsoic|3 
is a Wikimedia project that implements a second 
parser for wikitext. To follow the principle fo¬ 
cus on translation principle we only provide lim¬ 
ited editing and formatting options and side-step 
a lot of complexity of Wikipedia article structure 

'https://www.mediawiki.org/wiki/Parsoid 


without negatively affecting the translation pro¬ 
cess. CX is the first translation tool that provides 
a WYSIWYG editor using the annotated HTML 
provided by Parsoid. 

Some MT services neither support HTML in¬ 
put nor provide reordering information. Preserv¬ 
ing markup is an essential requirement for CX be¬ 
cause wiki text adaptation and WYSIWYG editing 
are based on the markup. We devised an algorithm 
that can reconstruct the reordering information by 
making the MT service do some additional work. 


5 MT evaluation 


We use the subjective evaluations of MT quality 
for a given language pair to decide whether we 
will include a MT service for a language pair in 
the tool. To evaluate a MT for a given language 
pair, we ask the potential future users of the tool 
to translate articles using it and tell whether it was 
useful for them or not. 


We run a MT service on our servers using 


the open-source Apertium project (Forcada et ah. 


20111, but we support other MT providers as well. 















6 Evaluation 


Currently the tool is only available to self-selected 
users (most of them experienced editors), hence 
the results cannot be generalised to the whole com¬ 
munity. Further studies on the resulting quality 
over a long term will help. 

The low deletion ratio for articles created us¬ 
ing CX suggests that there are no major problems 
in terms of quality. In three months of expos¬ 
ing the tool as an opt-in feature, 900 articles were 
published using CX with an overall deletion ratio 
lower than 1% across all languages, which is lower 
than the deletion rate for all new articles. 

We noticed that there is a significant difference 
between the number of created articles in differ¬ 
ent target language Wikipedias, which cannot be 
explained by the number of active users, number 
of available articles to translate nor the availabil¬ 
ity of MT. For example in three months the Cata¬ 
lan Wikipedia saw 455 articles created by translat¬ 
ing from Spanish with CX, but in the Portuguese 
Wikipedia only 25. Both language pairs have MT 
provided by Apertium. Statistics about the tool are 
collected publicl)|^ 

We have not yet made precise measurements on 
translation time saving, but we got positive reports 
from our users. In a roundtable|^ organised with 
editors of the Catalan Wikipedia, an experienced 
editor reported a 70% time saving. 

We found that English is the most used source 
language, consistent with Hale’s findings on mul¬ 


tilingual user behaviour (20131. 


also shows that quality of the published transla¬ 
tions is good, alleviating the community concerns. 
The low translation activity in multiple languages 
where the tool is already available needs further re¬ 
search. Close integration in Wikipedia allows CX 
to recruit users and suggest articles to translate in 
ways not possible with the previous tools. 
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